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Abstract 



Abstract 

Which are the factors underlying human information production on a global 
level? In order to gain an insight into this question we study a corpus of 252- 
633 mil. publicly available data files on the Internet corresponding to an overall 
storage volume of 284-675 Terabytes. Analyzing the file size distribution for sev- 
eral distinct data types we find indications that the neuropsychological capacity 
of the human brain to process and record information may constitute the dom- 
inant limiting factor for the overall growth of globally stored information, with 
real-world economic constraints having only a negligible influence. This suppo- 
sition draws support from the observation that the files size distributions follow 
a power law for data without a time component, like images, and a log-normal 
distribution for multimedia files, for which time is a defining qualia. 

Author Summary 

The generation of new information is limited by two key factors, by the incurring 
economic costs and by the capacity of the human brain to process and store data 
and information; the controlling agent needs to retain an overall understanding 
even when data is generated by semi-automatic processes. These processes are 
reflected in the statistical properties of the data files publicly available on the 
Internet. Collecting a corpus of 252-633 mil. files we find that the statistics of 
the file size distribution are consistent with the supposition that data production 
on a global level is shaped and limited by the neuropsychological information 
processing capacity of the brain, with economic and hardware constraints having 
a negligible infiuence. 



Introduction 

Information production and storage becomes progressively easier. Moore's law 
[I] states that technological advancements lead to a doubling of computing power 
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every 1.5 years and that data storage capacity increases by a factor of about 100 
every 10 years [2]. Data production, which has the goal to increase knowledge 
and information, is constrained on one side by the economic costs involved and 
on the other side by the neuropsychological limitations and costs of the data 
generating agents. Maximizing the total amount of information generated for 
given amounts of economic and neuropsychological resources hence determines 
the shape of the file-size distribution [3]. 

The economic costs for data production involving hardware, software and 
management are proportional to the amount of data produced. The overall goal 
of data production is the generation of information, which can be measured 
by Shannon's information entropy [3]. Maximization of information entropy 
under the constraint of economic costs leads to file size distributions having 
exponential tails [3l [5j [6]. Exponential tails are however absent both in our 
data and in an earlier study of the file-size distribution on a large number of 
Windows file systems on desktop computers [?]• The absence of exponential 
tails for files hosted on Internet servers indicates that economic costs are not 
the limiting factors for data production. 

The ability of the human brain to process and record information determines 
a subjective value which the producing individual attributes to an information 
source. E.g. the amount of information gained when increasing the resolution 
of a low quality image is substantially higher then when increasing the res- 
olution of a high quality photo by the same degree. This relation is known 
as Weber-Fechner law and results from underlying neurophysiological processes 
[3 m [To] . We find that the observed file-size distributions on the Internet are 
consistent with the Weber-Fechner law and propose that neuropsychological 
constraints may be a dominant factor in shaping the statistics of global data 
production. This hypothesis is based on the finding that the distribution func- 
tions maximizing information entropy, given the neurophysiological constraints 
of the Weber-Fechner law, nicely reproduce the real world file-size distributions. 

The neurophysiological constraints resulting from the Weber-Fechner law 
also imply that the different maximal entropy distributions are qualitatively 
different for data formats involving time, like audio and video, compared to file 
types not involving time, as it is the case for images. We find that these distinct 
predictions are very well in agreement with the observed files-size distributions. 

Maximal Entropy Distribution Functions 

Given a normalized distribution function P{s) for a corpus of data, in our case 
the file-size distribution, its information content can be measured by Shannon's 
information entropy [1], — J2s ^(^) log(^(5))- The overah goal of data produc- 
tion, to attain an optimal information content, is achieved when the respective 
information entropy is maximal. 

We denote with c(s) the cost function associated with the economic and neu- 
rophysiological constrains, and with A the respective Lagrange multiplier. The 
distribution function P(s) maximizing information entropy given the constraint 
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Figure 1: File-size distribution for 252 mil. files hosted in 7.7 domains. 

For all files types together and for the five Mime categories individually. Dis- 
played is the logio of the number of files in bins of 1 Kbyte. 

c(s) is determined by [U |6] 

sA[p] = 0, m = Yl ^(^) iog(^(^)) - ^ E ^(^)^(^) ' (1) 

s s 

where (5A[P] denotes the variation of the functional A[P] with respect to dis- 
tribution functions P{s). One obtains from ([T]) that P{s) ^ exp(— Ac(s)). The 
maximal entropy distributions have then the form 

P{s) ^ exp [ - X,s - Ai log(s) - A2 log'(s)] , (2) 

when considering cost functions containing terms proportional to the files size 
s, to log(s) and to log^(s). The first term, linear in the size of the files s, corre- 
sponds to economic costs. The other two terms in the cost functions correspond 
to the scaling of neurophysiological resources. 

The Weber-Fechner relations state that the neural representations of sen- 
sory stimuli [H], objects UHl E] and time perception [T^ in the brain scale 
logarithmically with the intensity of the stimuli, the number of objects and the 
length of the time interval respectively. The perceived costs and benefits of 
information generation and processing hence scale logarithmically with physical 
data volume. Maximization of information entropy under the logarithmic cost 
function yields a power-law file size distribution, as described by Eq. [2] 

The perceived cost function will scale furthermore as the square of the log- 
arithm whenever the data is characterized by two neurophysiological distinct 
degrees of freedom, such as resolution and time. The distribution maximizing 
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ill-degree 

Figure 2: File hosting vs. in-degree. Main: The number of domains (dark 
blue: few hosts, white: many hosts) with a given in-degree (x-axis) and hosting 
a given number of files (y-axis); all Mime categories without text/ and image/. 
Inset: For the 32 mil. hosts receiving incoming links from the Wikipedia/Dmoz 
corpus the distribution of the in-degree. 

information entropy will then be a log-normal file size distribution, see Eq. [2j 
We find that this is indeed the case for multimedia files, such as audio and 
video files, for which the time is defining qualia. The file size distributions of 
non-temporal data types (e.g. texts and images) follow, on the other side, a 
power-law. 

If the cost function scales as the square of the logarithm, the file-size distri- 
bution maximizing information entropy will then have a log-normal form, see 
Eq. [2] We find that this is indeed the case for multimedia files, such as audio 
and video, for which the time is defining qualia. The file size distributions of 
non-temporal data types (e.g. texts and images) is closer, on the other side, to 
a power- law. 

Results 

We performed a large scale search of publicly available files on the Internet, 
utilizing the spider of file search engine FindFiles.net. For the corpus of hosts 
to be crawled we selected the collection of all outgoing links in Wikipedia.org 
and Dmoz.org, the open directory project, scanning in both cases all available 
editions. We crawled, in a first effort, a total of 7.7 mil. hosts, indexing 252 
mil. data files. The resulting file size distribution is presented in Fig. [1] in a 
log-log representation, spanning nine orders of magnitude. The crawling effort 
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Figure 3: File-size distribution for videos and audio files (252 mil. files). 

Solid lines represent an eye guide of a quadratic form, a * log(size) — 5 * log^ (size) , 
where a, & > 0. 

was then continued in a second stage until a corpus of 633 mil. files had been 
reached, which we used for a systematic study of the statistical properties of 
the resulting file-size distribution. 

File Taxonomy 

Data files can be classified according to their Mime or Internet Media Types, e.g. 
a jpeg file has the Mime type image/jpeg within the Mime category image/. Five 
Mime categories make up about 99.9% of all data formats publicly accessible 
on the Internet, with application/ contributing 33.2%, audio/ 2.9%, image/ 
58.0%, text/ 5.1% and video/ 0.7% respectively to the total number of files 
in the Wikipedia/Dmoz corpus. The average number of files per host, the 
average file sizes (in Kbytes) and median file sizes (in Kbytes) are respectively 
(10.8|1312|136) for application/, (0.9|6733|1589) for audio/, (19.0|189|72) for 
image/, (1.7|3786|5) for text/ and (0.2|28912|5548) for video/. The average 
file size of 189 Kbytes for images in our data has seen an increase relative 
to the 15 Kbytes found in an earlier study [13]. The substantial difference 
between the respective means and medians is a consequence of the fat tails in 
the corresponding distribution functions, compare Fig. [TJ 

File size distribution 

Fig.[2]shows the correlation between the number of files hosted and the in-degree 
(the number of inbound links) of the hosting domain. Important domains tend 
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Figure 4: File-size distribution for jpeg and gif images (252 mil. files). 

The transition from a -2 to a -4 slope for the jpeg-file distribution occurs at 
about 4 Mbyte. This kink can be attributed to the transition from amateur to 
professional image production. 



to have a high in-degree [T3], e.g. the in-degree of Twitter.com is 805000 in the 
Wikipedia/Dmoz corpus. The number of publicly accessible data files hosted is 
however anti-correlated with the in-degree, most data being hosted on relatively 
unknown hosts. The power-law for the in-degree distribution presented in the 
inset of Fig. [2] has remained remarkably constant for the World Wide Web over 
the last decades. Our slope of —2.2 for the 32 mil. hosts within an one-click 
distance of the Wikipedia/Dmoz corpus is very close to the slopes between —1.94 
and —2.1 found in previous studies }151 [TBI [17] . 

A manifest property of the file size distribution presented in Fig. [T] is the 
absence of exponential tails, which one would have expected [3l [5] for an infor- 
mation entropy production constraint by economic limitations, like costs and 
available storage space. The lack of exponential tails has been observed in an 
earlier study of the file size distribution on a large number of Windows file sys- 
tems on desktop computers [7]. They have also found that the utilization ratio 
of desktop hard disks is, on the average, below the capacity. Thus, the full 
storage volume is rarely utilized by the average PC user. 

A differentiated perspective can be obtained when examining the functional 
form of the file size distributions for distinct Mime categories and types. The 
tails for the video and audio file distributions, shown in Fig. |3l and the tails 
for the file size distributions of jpeg and gif images presented in Fig. |4] differ 
manifestly. 

The linear dependence observed in Fig. |4] corresponds to a scale- free power- 
law P{s) oc s~'' of the file size distribution P{s) with distinct slopes for lossless 
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Figure 5: Power-law fit to the complementary cumulative file-size dis- 
tribution. The red dashed lines are the fits, the respective parameters are 
given in the insets. For Mime categories audio/, video/ (top) having a time 
component and application/, image/ (bottom) having no time component. 



and lossy image compression formats, gif and jpeg, respectively. For video 
and audio files the file size distribution follows a log- normal dependence, with 
log(P(s)) oc Q;log(s) — /31og^(s) fitting the data excellently over more than 5 
orders of magnitude. These two distributions differ qualitatively in two aspects, 
namely in the occurrence of the quadratic term log^(s) for the log-normal dis- 
tribution and in the sign of the linear term. The leading term —7 log(s) has a 
negative slope for image data formats and a positive slope a log(s) for the au- 
dio/ and video/ Mime categories (with a,j > 0). The log-normal dependence 
observed for video and audio files is hence qualitatively distinct with respect to 
a power-law scaling and cannot be interpreted as a quadratic correction to a 
linear fit within a log-log data analysis. 

The fact that the file size distributions and the distribution tails are quali- 
tatively different for multimedia and image file formats, strongly indicates that 
they are determined by the underlying neurophysiological limitations of the 
producing agents. The cost functions are therefore, see Eq. ([2]), proportional 
to log(s) and log^(s) for data characterized by one and two degrees of freedom, 
respectively. 
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Figure 6: Log-normal fit to the complementary cumulative file-size 
distribution. The red dashed lines are the fits, the respective parameters are 
given in the insets. For Mime categories audio/, video/ (top) having a time 
component and application/, image/ (bottom) having no time component. 



Fitting methods 

The guide-to-the-eye fits shown in Figs. |3] and |4] indicate that the statistical 
properties of the file-size distributions depend on the presence/ absence of a 
time-component . 

For a systematic analysis we used the corpus of 633 mil. files, containing 
four Mime categories, image/ (64.8%), application/ (31%), audio/ (3.5%) and 
video/ (0.7%). The evaluation of file size distribution was performed in two 
steps. In a first step the files were binned into 1 Kbyte bins. In a second step 
we evaluated maximum likelihood estimates for two model distributions [18] . 

We analyzed the tails of the respective file size distributions with two types 
of discrete probability distributions, a power law, 

p{k) = -^^—k~", Zk., 

and one having a log-normal form 



OO -, „ 

-^''^^ 

In both cases a lower bound kmin is introduced as a free parameter, as we don't 
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expect describing the whole range of data but only the tails of available data 
by either a power-law or log-normal distribution. 

The actual fitting procedure consist of following steps: 

• We've first performed a maximum likelihood estimate for the lower bound 
kmin in the range from 1 KB up to 100 MB. 

• Then, we have selected a kmin which minimizes residual sum of squares 
(rss) of the differences between the empirical and the fitted tails of the 
complementary cumulative distribution functions, that is 

rss = ^ {Pr{k > k') - Fik')f , (3) 

with F{k) = X]fc^=feP(^') being the complementary cumulative distribu- 
tion of the model and and Pr(fc > k') respectively the empirical comple- 
mentary cumulative file distribution. 

We present the best fits for the four Mime categories {image/, application/, 
audio/ and video/) for a power-law distribution in Fig. [51 and a log-normal dis- 
tribution in Fig. [51 respectively. In obtaining the maximum likelihood estimate 
for model parameters we have neglected files larger then 10 Gbytes, as there 
are only very few of these extremely big files, they are hence statistically not 
representative. 

Comparing the two fits for audio/ and video/ data we find that the log- 
normal distribution describes the empirical data substantially better. The rss 
values are order of magnitude lower in the case of log- normal fit (sec Table [1]) . 
The log-normal fit is also able to describe a broader range of the data then the 
power-law fit (compare Fig. [S] and Fig.[B]). 

Similarly, a power-law fit for application/ file-size distribution describes a 
broader range of the empirical data, and has an order of magnitude smaller 
value, then the one obtained for a log- normal fit (Table [1]). In the case of 
the image file types the evidence in favor of a power-law distribution is not 
particularly strong, a consequence of the kink at around 4 Mbytes, compare 
Fig. [31 Both fits, log-normal and power-law, describe a similar data range and 
the corresponding rss values are of similar magnitude. 



Table 1 : Residual sum of squares, rss, estimated as sum of square differences be- 
tween empirical and fitted complementary cumulative distributions (see Eq. [3]) . 





power-law fit 


log-normal fit 


image/ 


0.7 


1.0 


application/ 


3.3 


47.9 


audio/ 


26.4 


2.1 


video/ 


451.4 


2.7 
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Discussion 



For images the production costs are functionally dependent on one variable, the 
resolution, which defines, modulo compression algorithms, the file size. The 
cost function for the production of videos depends however on two distinct 
quantities, the resolution per frame and the total number of frames, viz the 
time needed to shoot the sequence. Analogously for audio files, with frequency 
resolution and length being the two cost defining quantities. The cost functions 
associated with information production are hence one- and two-dimensional for 
images and audio/video formats respectively. We generically observe in our data 
that one-dimensional cost functions result in power-law file size distributions, 
two-dimensional cost functions in log-normal distributions. 

For compound Mime categories or types, like text/, superpositions of these 
two basic distributions are observed. This correlation between dimensionality 
of data type and resulting file size distribution, which can be seen in Fig. [5] and 
Fig. [6l finds a straightforward rationale when accounting for the neuropsycho- 
logical constraints for data processing. 

Our analysis is based on the assumption that an ensemble average over 
many information producing agents reveals the underlying information theoret- 
ical principles driving data production on a global level. Other studies have 
investigated alternative approaches, like the study of microscopic models capa- 
ble of generating distributions with large tails, such as scale-free [T9l [20] and 
log- normal [HI and the double Pareto-lognormal distribution [53] . In a re- 
lated context a log-normal distribution has been found for the distribution of 
city sized and be related to proporionate growth mechanisms |24l . 

Both approaches, the modelling of generative processes and the information 
theoretical perspective, are complementary and do not exclude each other. Ulti- 
mately it may be possible to derive classes of microscopic generative models from 
comprehensive information theoretical principles, as it has been proposed, e.g., 
for intrinsic neural adaption rules generating information entropy maximizing 
firing rate distributions [5j [25] 
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